Data processing system, processor and method of data processing having branch target address cache storing direct predictions

ABSTRACT

In at least one embodiment, a processor includes at least one execution unit and instruction sequencing logic that fetches instructions for execution by the execution unit. The instruction sequencing logic includes branch logic that outputs predicted branch target addresses for use as instruction fetch addresses. The branch logic includes a branch target address cache (BTAC) having at least one direct entry providing storage for a direct branch target address prediction associating a first instruction fetch address with a branch target address to be used as a second instruction fetch address immediately after the first instruction fetch address and at least one indirect entry providing storage for an indirect branch target address prediction associating a third instruction fetch address with a branch target address to be used as a fourth instruction fetch address subsequent to both the third instruction fetch address and an intervening fifth instruction fetch address.

This invention was made with United States Government support underAgreement No. HR0011-07-9-0002 awarded by DARPA. The Government hascertain rights in the invention.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates in general to data processing and, inparticular, to branch prediction. Still more particularly, the presentinvention relates to a data processing system, processor and method ofdata processing with an improved branch target address cache (BTAC).

2. Description of the Related Art

A state-of-the-art microprocessor can comprise, for example, a cache forstoring instructions and data, an instruction sequencing unit forfetching instructions from the cache, ordering the fetched instructions,and dispatching the fetched instructions for execution, one or moresequential instruction execution units for processing sequentialinstructions, and a branch processing unit (BPU) for processing branchinstructions.

Branch instructions processed by the BPU can be classified as eitherconditional or unconditional branch instructions. Unconditional branchinstructions are branch instructions that change the flow of programexecution from a sequential execution path to a specified targetexecution path and which do not depend upon a condition supplied by theoccurrence of an event. Thus, the branch specified by an unconditionalbranch instruction is always taken. In contrast, conditional branchinstructions are branch instructions for which the indicated branch inprogram flow may be taken or not taken depending upon a condition withinthe processor, for example, the state of specified condition registerbit(s) or the value of a counter.

Conditional branch instructions can be further classified as eitherresolved or unresolved based upon whether or not the condition uponwhich the branch depends is available when the conditional branchinstruction is evaluated by the BPU. Because the condition upon which aresolved conditional branch instruction depends is known prior toexecution, resolved conditional branch instructions can typically beexecuted and instructions within the target execution path fetched withlittle or no delay in the execution of sequential instructions.Unresolved conditional branches, on the other hand, can createsignificant performance penalties if fetching of sequential instructionsis delayed until the condition upon which the branch depends becomesavailable and the branch is resolved.

Therefore, in order to minimize execution stalls, some processorsspeculatively predict the outcomes of unresolved branch instructions astaken or not taken. Utilizing the result of the prediction, theinstruction sequencing unit is then able to fetch instructions withinthe speculative execution path prior to the resolution of the branch,thereby avoiding a stall in the execution pipeline in cases in which thebranch is subsequently resolved as correctly predicted. Conventionally,prediction of unresolved conditional branch instructions has beenaccomplished utilizing static branch prediction, which predictsresolutions of branch instructions based upon criteria determined priorto program execution, or utilizing dynamic branch prediction, whichpredicts resolutions of branch instructions by reference to branchhistory accumulated on a per-address basis within a branch history table(BHT) and/or branch target address cache (BTAC).

Modern microprocessors require multiple cycles to fetch instructionsfrom the instruction cache, scan the fetched instructions for branches,and predict the outcome of unresolved conditional branch instructions.If any branch is predicted as taken, instruction fetch is redirected tothe new, predicted address. This process of changing which instructionsare being fetched is called “instruction fetch redirect”. During theseveral cycles required for the instruction fetch, branch scan, andinstruction fetch redirect, instructions continue to be fetched alongthe not taken path; in the case of a predicted-taken branch, theinstructions within the predicted-taken path are discarded, resulting indecreased performance and wasted power dissipation.

Several existing approaches are utilized to reduce or to eliminate theinstruction fetch redirect penalty. One commonly used method is theimplementation of a BTAC that in each entry caches the branch targetaddress of a taken branch in association with the branch instruction'stag. In operation, the BTAC is accessed in parallel with the instructioncache and is searched for an entry whose instruction tag matches thefetch address transmitted to the instruction cache. If such a BTAC entryexists, instruction fetch is redirected to the branch target addressprovided in the matching BTAC entry. Because the BTAC access typicallytakes fewer cycles than the instruction fetch, branch scan, and takenbranch redirect sequence, a correct BTAC prediction can improveperformance by causing instruction fetch to begin at a new addresssooner than if there were no BTAC present.

However, in conventional designs, the BTAC access still generallyrequires multiple cycles, meaning that in the case of a BTAC hit atleast one cycle elapses before the taken branch redirect. The intervalbetween the BTAC access and the instruction fetch redirect represents a“bubble” during which no useful work is performed by the instructionfetch pipeline. Unfortunately, this interval tends to grow as processorsachieve higher and higher operating frequencies and as BTAC sizesincrease in response to the larger total number of instructions (i.e.,“instruction footprint”) of newer software applications.

SUMMARY OF THE INVENTION

In at least one embodiment, a processor includes at least one executionunit and instruction sequencing logic that fetches instructions forexecution by the execution unit. The instruction sequencing logicincludes branch logic that outputs predicted branch target addresses foruse as instruction fetch addresses. The branch logic includes a branchtarget address cache (BTAC) having at least one direct entry providingstorage for a direct branch target address prediction associating afirst instruction fetch address with a branch target address to be usedas a second instruction fetch address immediately after the firstinstruction fetch address and at least one indirect entry providingstorage for an indirect branch target address prediction associating athird instruction fetch address with a branch target address to be usedas a fourth instruction fetch address subsequent to both the thirdinstruction fetch address and an intervening fifth instruction fetchaddress.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an exemplary embodiment of a data processing system inaccordance with the present invention;

FIG. 2 is a more detailed block diagram of the Branch Target AddressCache (BTAC) within the data processing system of FIG. 1;

FIG. 3 is a high level logical flowchart of an exemplary method by whicha Branch Target Address Cache (BTAC) generates instruction fetchaddresses in accordance with the present invention;

FIG. 4A is a high level logical flowchart of an exemplary method bywhich the branch target address predictions within the BTAC are updatedby branch logic in accordance with the present invention; and

FIG. 4B is a high level logical flowchart of an exemplary method bywhich the branch target address predictions within the BTAC are updatedin response to operation of a branch execution unit in accordance withthe present invention.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT

With reference now to FIG. 1, there is illustrated a high level blockdiagram of an exemplary data processing system 8 in accordance with thepresent invention. As shown, data processing system 8 includes aprocessor 10 comprising a single integrated circuit superscalarprocessor, which, as discussed further below, includes various executionunits, registers, buffers, memories, and other functional units that areall formed by integrated circuitry. Processor 10 may be coupled to otherdevices, such as a system memory 12 and a second processor 10, by aninterconnect fabric 14 to form a data processing system 8 such as aworkstation or server computer system. Processor 10 also includes anon-chip multi-level cache hierarchy including a unified level two (L2)cache 16 and bifurcated level one (L1) instruction (I) and data (D)caches 18 and 20, respectively. As is well known to those skilled in theart, caches 16, 18 and 20 provide low latency access to cache linescorresponding to memory locations in system memory 12.

Instructions are fetched and ordered for processing by instructionsequencing logic 13 within processor 10. In the depicted embodiment,instruction sequencing logic 13 includes an instruction fetch addressregister (IFAR) 30 that contains an effective address (EA) indicating ablock of instructions (e.g., a 32-byte cache line) to be fetched from L1I-cache 18 for processing. During each cycle, a new instruction fetchaddress (IFA) may be loaded into IFAR 30 from one of at least threesources: branch logic 36, which provides speculative branch targetaddresses resulting from the prediction of conditional branchinstructions, global completion table (GCT) 38, which providessequential path addresses, and branch execution unit (BEU) 92, whichprovides non-speculative addresses resulting from the resolution ofpredicted conditional branch instructions. The effective address loadedinto IFAR 30 is selected from among the addresses provided by themultiple sources according to a prioritization scheme, which may takeinto account, for example, the relative priorities of the sourcespresenting addresses for selection in a given cycle and the age of anyoutstanding unresolved conditional branch instructions.

If hit/miss logic 22 determines, after translation of the EA containedin IFAR 30 by effective-to-real address translation (ERAT) 32 and lookupof the real address (RA) in I-cache directory 34, that the block ofinstructions corresponding to the EA in IFAR 30 does not reside in L1I-cache 18, then hit/miss logic 22 provides the RA to L2 cache 16 as arequest address via I-cache request bus 24. Such request addresses mayalso be generated by prefetch logic within L2 cache 16 or elsewherewithin processor 10 based upon recent access patterns. In response to arequest address, L2 cache 16 outputs a cache line of instructions, whichare loaded into prefetch buffer (PB) 28 and L1 I-cache 18 via I-cachereload bus 26, possibly after passing through predecode logic (notillustrated).

Once the block of instructions specified by the EA in IFAR 30 resides inL1 cache 18, L1 I-cache 18 outputs the block of instructions to bothbranch logic 36 and to instruction fetch buffer (IFB) 40. As describedfurther below with respect to FIG. 2, branch logic 36 scans the block ofinstructions for branch instructions and predicts the outcome ofconditional branch instructions in the instruction block, if any.Following a branch prediction, branch logic 36 furnishes a speculativeinstruction fetch address to IFAR 30, as discussed above, and passes theprediction to branch instruction queue 64 so that the accuracy of theprediction can be determined when the conditional branch instruction issubsequently resolved by branch execution unit 92.

IFB 40 temporarily buffers the block of instructions received from L1I-cache 18 until the block of instructions can be translated, ifnecessary, by an instruction translation unit (ITU) 42. In theillustrated embodiment of processor 10, ITU 42 translates instructionsfrom user instruction set architecture (UISA) instructions (e.g.,PowerPC® instructions) into a possibly different number of internal ISA(IISA) instructions that are directly executable by the execution unitsof processor 10. Such translation may be performed, for example, byreference to microcode stored in a read-only memory (ROM) template. Inat least some embodiments, the UISA-to-IISA translation results in adifferent number of IISA instructions than UISA instructions and/or IISAinstructions of different lengths than corresponding UISA instructions.The resultant IISA instructions are then assigned by global completiontable 38 to an instruction group, the members of which are permitted tobe executed out-of-order with respect to one another. Global completiontable 38 tracks each instruction group for which execution has yet to becompleted by at least one associated EA, which is preferably the EA ofthe oldest instruction in the instruction group.

Following UISA-to-IISA instruction translation, instructions aredispatched in-order to one of latches 44, 46, 48 and 50 according toinstruction type. That is, branch instructions and other conditionregister (CR) modifying instructions are dispatched to latch 44,fixed-point and load-store instructions are dispatched to either oflatches 46 and 48, and floating-point instructions are dispatched tolatch 50. Each instruction requiring a rename register for temporarilystoring execution results is then assigned one or more registers withina register file by the appropriate one of CR mapper 52, link and count(LC) register mapper 54, exception register (XER) mapper 56,general-purpose register (GPR) mapper 58, and floating-point register(FPR) mapper 60.

The dispatched instructions are then temporarily placed in anappropriate one of CR issue queue (CRIQ) 62, branch issue queue (BIQ)64, fixed-point issue queues (FXIQs) 66 and 68, and floating-point issuequeues (FPIQs) 70 and 72. From issue queues 62, 64, 66, 68, 70 and 72,instructions can be issued opportunistically (i.e., possiblyout-of-order) to the execution units of processor 10 for execution. Insome embodiments, the instructions are also maintained in issue queues62-72 until execution of the instructions is complete and the resultdata, if any, are written back, in case any of the instructions needs tobe reissued.

As illustrated, the execution units of processor 10 include a CR unit(CRU) 90 for executing CR-modifying instructions, a branch executionunit (BEU) 92 for executing branch instructions, two fixed-point units(FXUs) 94 and 100 for executing fixed-point instructions, two load-storeunits (LSUs) 96 and 98 for executing load and store instructions, andtwo floating-point units (FPUs) 102 and 104 for executing floating-pointinstructions. Each of execution units 90-104 is preferably implementedas an execution pipeline having a number of pipeline stages.

During execution within one of execution units 90-104, an instructionreceives operands, if any, from one or more architected and/or renameregisters within a register file coupled to the execution unit. Whenexecuting CR-modifying or CR-dependent instructions, CRU 90 and BEU 92access the CR register file 80, which in a preferred embodiment containsa CR and a number of CR rename registers that each comprise a number ofdistinct fields formed of one or more bits. Among these fields are LT,GT, and EQ fields that respectively indicate if a value (typically theresult or operand of an instruction) is less than zero, greater thanzero, or equal to zero. Link and count register (LCR) register file 82contains a count register (CTR), a link register (LR) and renameregisters of each, by which BEU 92 may also resolve conditional branchesto obtain a path address. General-purpose register files (GPRs) 84 and86, which are synchronized, duplicate register files, store fixed-pointand integer values accessed and produced by FXUs 94 and 100 and LSUs 96and 98. Floating-point register file (FPR) 88, which like GPRs 84 and 86may also be implemented as duplicate sets of synchronized registers,contains floating-point values that result from the execution offloating-point instructions by FPUs 102 and 104 and floating-point loadinstructions by LSUs 96 and 98.

After an execution unit finishes execution of an instruction, theexecution notifies GCT 38, which schedules completion of instructions inprogram order. To complete an instruction executed by one of CRU 90,FXUs 94 and 100 or FPUs 102 and 104, GCT 38 signals the appropriatemapper, which sets an indication to indicate that the register fileregister(s) assigned to the instruction now contains the architectedstate of the register. The instruction is then removed from the issuequeue, and once all instructions within its instruction group havecompleted, is removed from GCT 38. Other types of instructions, however,are completed differently.

When BEU 92 resolves a conditional branch instruction and determines thepath address of the execution path that should be taken, the pathaddress is compared against the speculative path address predicted bybranch logic 36. If the path addresses match, branch logic 36 updatesits prediction facilities, if necessary. If, however, the calculatedpath address does not match the predicted path address, BEU 92 suppliesthe correct path address to IFAR 30, and branch logic 36 updates itsprediction facilities, as described further below. In either event, thebranch instruction can then be removed from BIQ 64, and when all otherinstructions within the same instruction group have completed, from GCT38.

Following execution of a load instruction (including a load-reserveinstruction), the effective address computed by executing the loadinstruction is translated to a real address by a data ERAT (notillustrated) and then provided to L1 D-cache 20 as a request address. Atthis point, the load operation is removed from FXIQ 66 or 68 and placedin load data queue (LDQ) 114 until the indicated load is performed. Ifthe request address misses in L1 D-cache 20, the request address isplaced in load miss queue (LMQ) 116, from which the requested data isretrieved from L2 cache 16, and failing that, from another processor 10or from system memory 12.

Store instructions (including store-conditional instructions) aresimilarly completed utilizing a store queue (STQ) 110 into whicheffective addresses for stores are loaded following execution of thestore instructions. From STQ 110, data can be stored into either or bothof L1 D-cache 20 and L2 cache 16, following effective-to-realtranslation of the target address.

Referring now to FIG. 2, there is depicted a more detailed block diagramof an exemplary embodiment of branch logic 36 of FIG. 1 in relation toother components of instruction sequencing logic 13. In the illustratedembodiment, branch logic 36 includes a historical instruction fetchaddress (IFA) buffer 160 that buffers one or more previous values ofIFAR 30 (if available), an instruction decoder 128, branch directionprediction circuitry, such as branch history table (BHT) 130, and branchtarget address prediction circuitry, such as branch target address cache(BTAC) 132. In alternative embodiments of the present invention, thebranch direction prediction circuitry can be implemented utilizing anyother type of branch direction prediction circuitry, including withoutlimitation, static branch prediction circuitry or two-level dynamicbranch prediction circuitry. In addition, the branch target addressprediction circuitry can also be implemented utilizing other known orfuture developed branch target address prediction circuitry, such as abranch target buffer (BTB). Further, in some embodiments, the physicalstructures utilized for branch direction prediction and branch targetaddress prediction may be merged. The present invention is equallyapplicable to all such embodiments.

Instruction decoder 128 is coupled to receive each cache line ofinstructions as it is fetched from L1 I-cache 18 and placed ininstruction fetch buffer 40. Instruction decoder 128 scans each cacheline of instructions for branch instructions, and in response todetecting a branch instruction, forwards the branch instruction to thebranch direction prediction circuitry (e.g., BHT 130) for directionprediction. As further indicated by the connection between BHT 130 andinstruction fetch buffer 40, in the event BTAC 132 invokes fetchingalong a path that BHT 130 predicts as not-taken, BHT 130 cancels theinstructions in the incorrect path from instruction fetch buffer 40 andredirects fetching along the correct path.

In accordance with the present invention, the branch target addressprediction circuitry (hereinafter, referred to as BTAC 132) includes anindirect BTAC 150 having an N-cycle access latency (e.g., two cycles)that stores addresses of instruction blocks to be fetched N processorclock cycle later. For example, in embodiments in which N=2, indirectBTAC 150 stores address of instruction blocks to be fetched in theprocessor clock cycle following fetching of the next instruction block.Thus, for the instruction address sequence 0x100, 0x120, 0x200, BTAC 150would store an entry associating instruction address 0x100 withpredicted branch target address 0x200.

In the depicted embodiment, indirect BTAC 150 includes multiple entries142, each including a tag (T) field 144 for storing a tag portion of aninstruction fetch address (IFA), a branch target address (BTA) field 146for storing a BTA, and a state (S) field 148 indicating stateinformation for the entry 142. In various embodiments, state field 148may simply indicate validity of its entry 142 or may alternatively oradditionally provide additional information, such as the type of entryand/or a score indicating a confidence in the correctness of the BTA.

As discussed further below, the ability to establish indirectpredictions in BTAC 150 depends upon the availability of previousinstruction fetch addresses, for example, in historical IFA buffer 160.However, under some operating conditions, for example, following machinereboot, context switch or pipeline flush, historical IFA buffer 160 isempty and therefore cannot provide the previous instruction fetchaddress. For performance reasons, it is nevertheless desirable undersuch operating conditions to establish BTA predictions in BTAC 132.

Accordingly, a BTAC 132 in accordance with the present invention furtherincludes storage for direct predictions, meaning predicted addresses ofthe next instruction blocks to be fetched. Storage for directpredictions can be implemented in BTAC 132 in a number of differentways. For example, in some embodiments, one or more direct predictionsare stored in one or more entries 142 of indirect BTAC 150 and marked instatus field 148 as direct predictions. Alternatively, or additionally,one or more entries 142 in indirect BTAC 150 may be exclusivelyallocated for direct predictions or may be stored in a way of aset-associative BTAC containing at least one way for indirectpredictions and at least one way for direct predictions. Further, asfurther shown in FIG. 2, BTAC 132 may include dedicated storage fordirect predictions in a direct BTAC 140. As indicated by like referencenumerals, direct BTAC 140, like indirect BTAC 150, includes multipleentries 142, each including a tag (T) field 144 for storing a tagportion of an instruction fetch address (IFA), a branch target address(BTA) field 146 for storing a BTA, and a state (S) field 148 indicatingstate information for the entry 142.

In operation, direct BTAC 140 and indirect BTAC 150 are accessed by thetag 162 of the IFA in IFAR 30 in parallel with the access to L1 I-cache18. If tag 162 of the IFA in IFAR 30 does not match the contents of anytag field 144 of any entry 142 in direct BTAC 140 or indirect BTAC 150,direct BTAC 140 and indirect BTAC 150 deassert their respective hitsignals 152 a and 152 b. If, on the other hand, tag 162 matches thecontents of a tag field 144 of an entry 142 in one or both of directBTAC 140 and indirect BTAC 150, each BTAC in which a hit occurs assertsits hit signal 152 a or 152 b and outputs the BTA associated with thematching tag field 144. If a hit occurs in direct BTAC 140, the BTA isqualified at a first buffer 154 a by the state information within thestate field 148 of the matching entry 142, and if successfullyqualified, is presented to IFAR 30 for selection. If a hit alternativelyor additionally occurs in indirect BTAC 150, the BTA is qualified at asecond buffer 154 b by the state information within the state field 148of the matching entry 142 and additionally by the deassertion of hitsignal 152 a (as indicated by inverter 158 and AND gate 156). If the BTAoutput by indirect BTAC 150 is successfully qualified, second buffer 152presents the BTA output by indirect BTAC 150 to IFAR 30. Thus, if directBTAC 140 and indirect BTAC 150 both output a branch target addressprediction in a given processor clock cycle, the result output by directBTAC 140 is used.

BTAC 132 is updated, as needed, when branch prediction is performed. Asshown, hit signals 152 a, 152 b from direct BTAC 140 and indirect BTAC150 are passed to the branch direction prediction circuitry (e.g., BHT130). If the result of the branch direction prediction is not aninstruction fetch redirect (i.e., the branch is predicted as not taken)and branch instruction tag 162 hit in one or both of direct BTAC 140 andindirect BTAC 150, BHT 130 sends an invalidation request to remove theincorrect branch target address prediction from each of BTAC 140 andBTAC 150. Alternatively, if the result of the branch directionprediction is an instruction fetch redirect and branch instruction tag162 missed in both direct BTAC 140 and indirect BTAC 150, BHT 130 sendsan insertion request to a selected one of BTACs 140, 150 to requestinsertion of a new entry 142 associated the IFA with the branch targetaddress to which fetching was redirected. In a preferred embodiment,when a BTAC insertion request is generated, the request is sent to BTAC140 or BTAC 150 based upon whether a corresponding IFA resides inhistorical IFA buffer 160. Thus, BHT 130 transmits the insertion requestto indirect BTAC 150 if the IFA immediately preceding the IFA of thepredicted branch is still buffered in historical IFA buffer 160 when theinsertion request is generated. Otherwise, BHT 130 directs the insertionrequest to direct BTAC 140. Further details regarding the operation ofBTAC 132 are described below with reference to FIGS. 3-4.

With reference now to FIG. 3, there is illustrated a high level logicalflowchart of an exemplary method by which BTAC 132 generates and outputsspeculative branch target addresses (BTAs) in accordance with thepresent invention. The depicted process begins at block 300 and thenproceeds to block 302, which illustrates BTAC 132 receiving the tag 162of the instruction fetch address (IFA) in IFAR 30 concurrently with thetransmission of the IFA to L1 I-cache 18 to initiate a fetch of aninstruction block. In response to receipt of tag 162 by BTAC 132,indirect BTAC 150 and direct BTAC 140 (or other storage for direct BTApredictions) are accessed concurrently to determine if tag 162 hits inan entry 142, that is, whether the tag 162 matches the contents of anyof tag fields 144 of entries 142.

If the current tag 162 hits in direct BTAC 140 (and the resulting hitsignal 152 a is successfully qualified by the contents of the statefield 148 of the matching entry 142), the process proceeds to block 312.Block 312 depicts BTAC 132 presenting to IFAR 30 the predicted branchtarget address output by direct BTAC 140. However, if the current tag162 misses in direct BTAC 140 and hits in indirect BTAC 150 as shown atblock 308 (and the resulting hit signal 152 b is successfully qualifiedby the contents of the state field 148 of the matching entry 142), theprocess proceeds to block 310. Block 310 illustrates BTAC 132 presentingto IFAR 30 the predicted branch target address output by indirect BTAC150 N processor clock cycles following receipt by BTAC 132 of theprevious tag 162. Following either block 310 or block 312, or followinga negative determination at both of blocks 306 and 308, the processillustrated in FIG. 3 terminates until a next tag 162 is received byBTAC 132.

Referring now to FIG. 4A, there is illustrated a high level logicalflowchart that depicts an exemplary method by which the branch targetaddress predictions within BTAC 132 are updated in accordance with thepresent invention. The process begins at block 400 of FIG. 4 and thenpasses to block 402, which depicts branch logic 36 determining whetheror not a block of instructions (e.g., a 32 byte cache line) fetched fromL1 I-cache 18 includes a branch instruction. If not, no update to BTAC36 is made. The process shown in FIG. 4 therefore passes from block 402to block 440, which depicts branch logic 36 saving the IFA of theinstruction fetch block as the previous IFA within historical IFA buffer160 at block 440. The process thereafter terminates at block 441 until asubsequent instruction block is fetched.

Returning to block 402, if branch logic 36 determines at the 402 thatthe fetched instruction block includes a branch instruction, the processproceeds to block 410. Block 410 depicts branch logic 36 determiningwhether the fetched instruction block contains an unconditional takenbranch or a conditional branch predicted as “taken” by BHT 130. If so,the process proceeds from block 410 to block 420, which is describedbelow. If not, the process passes to block 412, which depicts branchlogic 36 determining from hit signals 152 a, 152 b whether the tag 162of the IFA hit in one or more of direct BTAC 140 and indirect BTAC 150.If not, no update to BTAC 132 is made, and the process passes from block412 to blocks 440-441, which have been described. If, however, adetermination is made at block 412 that tag 162 hits in one or more ofdirect BTAC 140 and indirect BTAC 150, meaning that BTAC 132 has atleast one entry predicting a redirect target address for a fetchedinstruction block containing no branch that would cause a fetchredirect, branch logic 36 invalidates each entry 142 in BTAC 132 (i.e.,in direct BTAC 140 and/or indirect BTAC 150) with matching tag 162(block 414). Such invalidation may be performed, for example, byupdating the state field(s) of the relevant entry or entries 142.Thereafter, the process passes to blocks 440-441, which have beendescribed.

Referring now to block 420, if branch logic 36 determines that a branchinstruction within the fetched instruction block was eitherunconditionally “taken” or predicted as “taken” and tag 162 hit in BTAC132, the process proceeds to block 430, which is described below. If,however, branch logic 36 determines at block 420 that a branchinstruction within the fetched instruction block was “taken” and tag 162missed in BTAC 132, the process proceeds to block 422. Block 422illustrates branch logic 36 determining whether historical IFA buffer160 buffers the previous IFA immediately preceding the IFA thatgenerated the fetch of the instruction block containing the taken branchinstruction in question. As noted above, IFA buffer 160 may not bufferthe IFA for a number of reasons, including occurrence of a reboot of themachine, a context switch, or a pipeline flush.

If branch logic 36 determines at block 422 that the previous IFA is notavailable, branch logic 36 inserts within direct BTAC 140 a new entry142 containing the tag portion of the current IFA in tag field 144 andthe branch target address predicted by BHT 130 in BTA field 146. If, onthe other hand, branch logic 36 determines at block 422 that historicalIFA buffer 160 still retains the previous IFA immediately preceding thecurrent IFA that generated the fetch of the instruction block containingthe conditional branch instruction in question, branch logic 36 insertswithin indirect BTAC 150 a new entry 142 containing the tag portion ofthe previous IFA in tag field 144 and the branch target addresspredicted by BHT 130 in BTA field 146. Following either of blocks 424and 426, the process passes to blocks 440-441, which have beendescribed.

With reference now to block 430, if branch logic 36 determines that thefetched instruction block contains a taken branch and tag 162 hit inBTAC 132, branch logic 36 further determines whether the BTA predictionis confirmed as correct by BHT 130. If so, no update to BTAC 132 isrequired, and the process proceeds to blocks 440-441, which have beendescribed. If, however, BHT 130 indicates at block 430 that the BTApredicted by BTAC 132 was incorrect, branch logic 36 updates the BTAfield 146 of the entry 142 that provided the incorrect BTA predictionwith the correct BTA. Thereafter, the process proceeds to blocks440-441, which have been described.

With reference now to FIG. 4B, there is illustrated a high level logicalflowchart of an exemplary process by which BEU 92 updates branch targetaddress predictions in branch unit 36, if necessary, in response to thedetection of branch mispredictions. The process begins at block 450 andthen proceeds to block 452, which depicts BEU 92 determining whether thecorrect behavior of a branch is “taken.” If a determination is made atblock 452 that the branch resolved by BEU 92 is “not taken”, the processproceeds to block 454. Block 454 depicts BEU 92 invalidating each entry142 within direct BTAC 140 whose tag field 144 matches the tag portion162 of the IFA of the instruction fetch block including the resolvedbranch. If a determination is made at block 452 that the branch resolvedby BEU 92 is “taken”, the process proceeds to block 456. Block 456depicts BEU 92 inserting within direct BTAC 140 a new entry 142containing the tag portion of the IFA of the instruction fetch blockincluding the resolved branch in tag field 144 and containing thecorrect branch target address calculated by BEU 92 in BTA field 146.Thereafter, the process shown in FIG. 4B ends at block 460. Concurrentlywith the operations shown at blocks 452-456, BEU 92 also places the IFAof the instruction fetch block containing the resolved branch inhistorical IFA buffer 160, displacing a previous IFA if necessary, asshown at block 458.

As has been described, the present invention provides a data processingsystem, processor and method of data processing in which an improvedbranch target address cache (BTAC) is utilized to generate branch targetaddress predictions. In accordance with the present invention, the BTACincludes storage for both direct and indirect branch target addresspredictions. In a preferred embodiment, the direct address prediction isgiven precedence over an indirect address prediction for the sameinstruction address.

While the invention has been particularly shown as described withreference to a preferred embodiment, it will be understood by thoseskilled in the art that various changes in form and detail may be madetherein without departing from the spirit and scope of the invention.

1. A processor, comprising: at least one execution unit that executesinstructions; and instruction sequencing logic, coupled to the at leastone execution unit, that fetches instructions from a memory system forexecution by the at least one execution unit, said instructionsequencing logic including a branch logic that outputs predicted branchtarget addresses for use as instruction fetch addresses, said branchlogic including a branch target address cache (BTAC) having: at leastone direct entry providing storage for a direct branch target addressprediction associating a first instruction fetch address with a branchtarget address to be used as a second instruction fetch addressimmediately after the first instruction fetch address; and at least oneindirect entry providing storage for an indirect branch target addressprediction associating a third instruction fetch address with a branchtarget address to be used as a fourth instruction fetch addresssubsequent to both the third instruction fetch address and anintervening fifth instruction fetch address.
 2. The processor of claim1, wherein said at least one direct entry and said at least one indirectentry reside in a same cache array.
 3. The processor of claim 1,wherein: said at least one direct entry and said at least one indirectentry reside in different ways of a set-associative cache array.
 4. Theprocessor of claim 1, wherein: said at least one direct entry and saidat least one indirect entry each includes a status field identifyingsaid at least one direct entry and said at least one indirect entry ascontaining a direct branch target address prediction and an indirectbranch target address prediction, respectively.
 5. The processor ofclaim 1, wherein if the first instruction fetch address matches thethird instruction fetch address, said branch logic employs the directbranch target address prediction rather than the indirect branch targetaddress prediction.
 6. The processor of claim 1, wherein: the at leastone execution unit includes a branch execution unit that executes branchinstructions; the branch execution unit reports outcomes of executedbranch instructions to the branch logic; and the branch logic inserts anew branch target address prediction into the BTAC based upon thereported outcome of an executed branch instruction.
 7. The processor ofclaim 6, wherein: the branch logic includes a buffer that holds at leastone previous instruction fetch address; and the branch logic insertseither an indirect entry or a direct entry based upon availability inthe buffer of a particular previous instruction fetch addresscorresponding to the new branch target address prediction.
 8. Theprocessor of claim 1, wherein: the memory system includes a cache memorywithin the processor; and the instruction sequencing logic accesses theBTAC and the cache memory concurrently with the first instruction fetchaddress.
 9. A data processing system, comprising: at least one processorin accordance with claim 1; an interconnect coupled to the processor;and the memory system coupled to the processor via the interconnect andoperable to communicate data with the at least one processor.
 10. Amethod of data processing in a processor including at least oneexecution unit and an instruction sequencing logic containing branchlogic, the branch logic including a branch target address cache (BTAC),said method comprising: in the BTAC, holding: at least one direct entryproviding storage for a direct branch target address predictionassociating a first instruction fetch address with a branch targetaddress to be used as a second instruction fetch address immediatelyafter the first instruction fetch address; and at least one indirectentry providing storage for an indirect branch target address predictionassociating a third instruction fetch address with a branch targetaddress to be used as a fourth instruction fetch address subsequent toboth the third instruction fetch address and an intervening fifthinstruction fetch address; fetching instructions from a memory systemfor execution by at least one execution unit of the processor; thebranch logic accessing the BTAC with at least a tag portion of a firstinstruction fetch address; and in response to said accessing, outputtingeither said second instruction fetch address or said fourth instructionfetch address.
 11. The method of claim 10, wherein said holdingcomprises holding said at least one direct entry and said at least oneindirect entry reside in a same cache array.
 12. The method of claim 10,wherein said holding comprises holding said at least one direct entryand said at least one indirect entry reside in different ways of aset-associative cache array.
 13. The method of claim 10, and furthercomprising: identifying said at least one direct entry and said at leastone indirect entry as containing a direct branch target addressprediction and an indirect branch target address prediction,respectively, with a status field.
 14. The method of claim 10, whereinsaid outputting comprises: if the first instruction fetch addressmatches the third instruction fetch address, outputting the secondinstruction fetch address.
 15. The method of claim 10, and furthercomprising: a branch execution unit executing branch instructions; thebranch execution unit reporting outcomes of executed branch instructionsto the branch logic; and the branch logic inserting a new branch targetaddress prediction into the BTAC based upon the reported outcome of anexecuted branch instruction.
 16. The method of claim 15, wherein: thebranch logic includes a buffer that holds at least one previousinstruction fetch address; and said inserting comprises the branch logicinserting either an indirect entry or a direct entry based uponavailability in the buffer of a particular previous instruction fetchaddress corresponding to the new branch target address prediction. 17.The method of claim 10, wherein: said memory system includes a cachememory; and said accessing comprises the instruction sequencing logicaccessing the BTAC and a cache memory concurrently with the firstinstruction fetch address.